Linear Regression

Advance Analytics with R (UG 21-24)

Ayush Patel

Before we start

__Please load the following packages

library(tidyverse)
library(MASS)
library(ISLR)
library(ISLR2)



Access lecture slide from bit.ly/aar-ug

Warrior's armor(gusoku)
Source: Armor (Gusoku)

Hello

I am Ayush.

I am a researcher working at the intersection of data, law, development and economics.

I teach Data Science using R at Gokhale Institute of Politics and Economics

I am a RStudio (Posit) certified tidyverse Instructor.

I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.

Reach me

ayush.ap58@gmail.com

ayush.patel@gipe.ac.in

Learning Objective

Learn to apply and interpret simple and multiple linear regression models.

References for this lecture:

  • Chapter 3, ISLR (reference)
  • Chapters 7 and 8, Intro to Modern Statistics (Reading for intuitive understanding)

Advertising Data

...1 TV radio newspaper sales
1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9
6 8.7 48.9 75.0 7.2
7 57.5 32.8 23.5 11.8
8 120.2 19.6 11.6 13.2
9 8.6 2.1 1.0 4.8
10 199.8 2.6 21.2 10.6
11 66.1 5.8 24.2 8.6
12 214.7 24.0 4.0 17.4
13 23.8 35.1 65.9 9.2
14 97.5 7.6 7.2 9.7
15 204.1 32.9 46.0 19.0
16 195.4 47.7 52.9 22.4
17 67.8 36.6 114.0 12.5
18 281.4 39.6 55.8 24.4
19 69.2 20.5 18.3 11.3
20 147.3 23.9 19.1 14.6
21 218.4 27.7 53.4 18.0
22 237.4 5.1 23.5 12.5
23 13.2 15.9 49.6 5.6
24 228.3 16.9 26.2 15.5
25 62.3 12.6 18.3 9.7
26 262.9 3.5 19.5 12.0
27 142.9 29.3 12.6 15.0
28 240.1 16.7 22.9 15.9
29 248.8 27.1 22.9 18.9
30 70.6 16.0 40.8 10.5
31 292.9 28.3 43.2 21.4
32 112.9 17.4 38.6 11.9
33 97.2 1.5 30.0 9.6
34 265.6 20.0 0.3 17.4
35 95.7 1.4 7.4 9.5
36 290.7 4.1 8.5 12.8
37 266.9 43.8 5.0 25.4
38 74.7 49.4 45.7 14.7
39 43.1 26.7 35.1 10.1
40 228.0 37.7 32.0 21.5
41 202.5 22.3 31.6 16.6
42 177.0 33.4 38.7 17.1
43 293.6 27.7 1.8 20.7
44 206.9 8.4 26.4 12.9
45 25.1 25.7 43.3 8.5
46 175.1 22.5 31.5 14.9
47 89.7 9.9 35.7 10.6
48 239.9 41.5 18.5 23.2
49 227.2 15.8 49.9 14.8
50 66.9 11.7 36.8 9.7
51 199.8 3.1 34.6 11.4
52 100.4 9.6 3.6 10.7
53 216.4 41.7 39.6 22.6
54 182.6 46.2 58.7 21.2
55 262.7 28.8 15.9 20.2
56 198.9 49.4 60.0 23.7
57 7.3 28.1 41.4 5.5
58 136.2 19.2 16.6 13.2
59 210.8 49.6 37.7 23.8
60 210.7 29.5 9.3 18.4
61 53.5 2.0 21.4 8.1
62 261.3 42.7 54.7 24.2
63 239.3 15.5 27.3 15.7
64 102.7 29.6 8.4 14.0
65 131.1 42.8 28.9 18.0
66 69.0 9.3 0.9 9.3
67 31.5 24.6 2.2 9.5
68 139.3 14.5 10.2 13.4
69 237.4 27.5 11.0 18.9
70 216.8 43.9 27.2 22.3
71 199.1 30.6 38.7 18.3
72 109.8 14.3 31.7 12.4
73 26.8 33.0 19.3 8.8
74 129.4 5.7 31.3 11.0
75 213.4 24.6 13.1 17.0
76 16.9 43.7 89.4 8.7
77 27.5 1.6 20.7 6.9
78 120.5 28.5 14.2 14.2
79 5.4 29.9 9.4 5.3
80 116.0 7.7 23.1 11.0
81 76.4 26.7 22.3 11.8
82 239.8 4.1 36.9 12.3
83 75.3 20.3 32.5 11.3
84 68.4 44.5 35.6 13.6
85 213.5 43.0 33.8 21.7
86 193.2 18.4 65.7 15.2
87 76.3 27.5 16.0 12.0
88 110.7 40.6 63.2 16.0
89 88.3 25.5 73.4 12.9
90 109.8 47.8 51.4 16.7
91 134.3 4.9 9.3 11.2
92 28.6 1.5 33.0 7.3
93 217.7 33.5 59.0 19.4
94 250.9 36.5 72.3 22.2
95 107.4 14.0 10.9 11.5
96 163.3 31.6 52.9 16.9
97 197.6 3.5 5.9 11.7
98 184.9 21.0 22.0 15.5
99 289.7 42.3 51.2 25.4
100 135.2 41.7 45.9 17.2
101 222.4 4.3 49.8 11.7
102 296.4 36.3 100.9 23.8
103 280.2 10.1 21.4 14.8
104 187.9 17.2 17.9 14.7
105 238.2 34.3 5.3 20.7
106 137.9 46.4 59.0 19.2
107 25.0 11.0 29.7 7.2
108 90.4 0.3 23.2 8.7
109 13.1 0.4 25.6 5.3
110 255.4 26.9 5.5 19.8
111 225.8 8.2 56.5 13.4
112 241.7 38.0 23.2 21.8
113 175.7 15.4 2.4 14.1
114 209.6 20.6 10.7 15.9
115 78.2 46.8 34.5 14.6
116 75.1 35.0 52.7 12.6
117 139.2 14.3 25.6 12.2
118 76.4 0.8 14.8 9.4
119 125.7 36.9 79.2 15.9
120 19.4 16.0 22.3 6.6
121 141.3 26.8 46.2 15.5
122 18.8 21.7 50.4 7.0
123 224.0 2.4 15.6 11.6
124 123.1 34.6 12.4 15.2
125 229.5 32.3 74.2 19.7
126 87.2 11.8 25.9 10.6
127 7.8 38.9 50.6 6.6
128 80.2 0.0 9.2 8.8
129 220.3 49.0 3.2 24.7
130 59.6 12.0 43.1 9.7
131 0.7 39.6 8.7 1.6
132 265.2 2.9 43.0 12.7
133 8.4 27.2 2.1 5.7
134 219.8 33.5 45.1 19.6
135 36.9 38.6 65.6 10.8
136 48.3 47.0 8.5 11.6
137 25.6 39.0 9.3 9.5
138 273.7 28.9 59.7 20.8
139 43.0 25.9 20.5 9.6
140 184.9 43.9 1.7 20.7
141 73.4 17.0 12.9 10.9
142 193.7 35.4 75.6 19.2
143 220.5 33.2 37.9 20.1
144 104.6 5.7 34.4 10.4
145 96.2 14.8 38.9 11.4
146 140.3 1.9 9.0 10.3
147 240.1 7.3 8.7 13.2
148 243.2 49.0 44.3 25.4
149 38.0 40.3 11.9 10.9
150 44.7 25.8 20.6 10.1
151 280.7 13.9 37.0 16.1
152 121.0 8.4 48.7 11.6
153 197.6 23.3 14.2 16.6
154 171.3 39.7 37.7 19.0
155 187.8 21.1 9.5 15.6
156 4.1 11.6 5.7 3.2
157 93.9 43.5 50.5 15.3
158 149.8 1.3 24.3 10.1
159 11.7 36.9 45.2 7.3
160 131.7 18.4 34.6 12.9
161 172.5 18.1 30.7 14.4
162 85.7 35.8 49.3 13.3
163 188.4 18.1 25.6 14.9
164 163.5 36.8 7.4 18.0
165 117.2 14.7 5.4 11.9
166 234.5 3.4 84.8 11.9
167 17.9 37.6 21.6 8.0
168 206.8 5.2 19.4 12.2
169 215.4 23.6 57.6 17.1
170 284.3 10.6 6.4 15.0
171 50.0 11.6 18.4 8.4
172 164.5 20.9 47.4 14.5
173 19.6 20.1 17.0 7.6
174 168.4 7.1 12.8 11.7
175 222.4 3.4 13.1 11.5
176 276.9 48.9 41.8 27.0
177 248.4 30.2 20.3 20.2
178 170.2 7.8 35.2 11.7
179 276.7 2.3 23.7 11.8
180 165.6 10.0 17.6 12.6
181 156.6 2.6 8.3 10.5
182 218.5 5.4 27.4 12.2
183 56.2 5.7 29.7 8.7
184 287.6 43.0 71.8 26.2
185 253.8 21.3 30.0 17.6
186 205.0 45.1 19.6 22.6
187 139.5 2.1 26.6 10.3
188 191.1 28.7 18.2 17.3
189 286.0 13.9 3.7 15.9
190 18.7 12.1 23.4 6.7
191 39.5 41.1 5.8 10.8
192 75.5 10.8 6.0 9.9
193 17.2 4.1 31.6 5.9
194 166.8 42.0 3.6 19.6
195 149.7 35.6 6.0 17.3
196 38.2 3.7 13.8 7.6
197 94.2 4.9 8.1 9.7
198 177.0 9.3 6.4 12.8
199 283.6 42.0 66.2 25.5
200 232.1 8.6 8.7 13.4

Association between slaes and budget?

How strong is the association, if any?

sales and TV

[1] 0.7822244

sales and radio

[1] 0.5762226

sales and newspaper

[1] 0.228299

Linear model

A linear model can help us answer questions about association between response and predictors, predict sales in future, linearity of relation, and interaction between predictors.

A simple linear model

\[ Y \approx \beta_0 + \beta_1X \]

\[\beta_0\hspace{1mm} is\hspace{1mm}population\hspace{1mm}intercept\]

\[\beta_1\hspace{1mm} is\hspace{1mm}population\hspace{1mm}slope\] Our estimates are represented as :

\[ \hat\beta_0\] \[\hat\beta_1\]

How to reach the best estimate?

The Idea is to, essentially, draw a line through the points such that distance of every point from line is as small a possible.

Least squares

One way to get estimates of population coefficients or parameters is minimizing least squares.

\[sales \approx \beta_0 + \beta_1*TV\]

\[\hat y_i = \hat\beta_0 + \hat\beta_1x_i\] \[e_i = y_i - \hat y_i\]

\[RSS = e_1^2 + e_2^2....+e_n^2\]

Minimize RSS

Least square coefficient estimates

\[ \hat\beta_1 = \frac{\sum_i^n(x_i - \bar x)(y_i - \bar y)}{\sum_i^n(x_i - \bar x)^2} \]

\[ \hat\beta_0 = \bar y - \hat\beta_1\bar x \]

The model


Call:
lm(formula = sales ~ TV, data = advertisement)

Coefficients:
(Intercept)           TV  
    7.03259      0.04754  

“For every`additional $1000 spent on TV advertisement budget, there is additional sale of ~47.5 units”

Exercise

Use the data Auto from the {ISRL2} Fit this model.

\[horsepower = \beta_0 + \beta_1*weight + \epsilon\]

find coeff estimates and residuals: \[\hat\beta_0\] and \[\hat\beta_1\]

How well did we estimate the coefficients?

\[Compute\hspace{1mm} standard\hspace{1mm}error\hspace{1mm} of\hspace{1mm} \hat\beta_0\hspace{1mm} and\hspace{1mm} \hat\beta_1\]

something like this:

\[Var(\hat\mu) = SE(\hat\mu) = \frac{\sigma^2}{n}\]

but in reality

\[SE(\hat\beta_0)^2 = \sigma^2[\frac{1}{n}+\frac{\bar x^2}{\sum_i^n(x_i - \bar x)^2}]\hspace{2cm}SE(\hat\beta_1)^2 = \frac{\sigma^2}{\sum_i^n(x_i - \bar x)^2}\]

What is sigma here ?

\[what\hspace{1mm} happens\hspace{1mm} when\hspace{1mm} x_i\hspace{1mm} are\hspace{1mm} spread\hspace{1mm} out\hspace{1mm} ?\]

We can use SE to to hypothesis testing. t-statistic is used to do this in practise

\[t = \frac{\hat\beta_1 - 0}{SE(\hat\beta_1)}\]


Call:
lm(formula = sales ~ TV, data = advertisement)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.3860 -1.9545 -0.1913  2.0671  7.2124 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 7.032594   0.457843   15.36   <2e-16 ***
TV          0.047537   0.002691   17.67   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.259 on 198 degrees of freedom
Multiple R-squared:  0.6119,    Adjusted R-squared:  0.6099 
F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16

Assessing Model Accuracy

RSE and R^2

mod <- summary(lm(sales ~ TV, data = advertisement))
mod$sigma
[1] 3.258656
mod$sigma/mean(advertisement$sales)
[1] 0.2323877
mod$r.squared
[1] 0.6118751

Residual Standard Error is the standard deviation of \(\epsilon\). Essentially \(\sqrt\frac{RSS}{n-2}\)

This is a measure of lack of fit.

It is in terms of Y, so scale is involved(Beware).

It is the proportion of variance in the response that is explained by the model.

TSS is total sum of squares \(\sum_i^n(y_i - \bar y)\)

\[\frac{TSS-RSS}{TSS}\]

\(R^2\) is between 0 and 1. Therefore easier to interpret.

Multiple Linear Regression

  • In reality, there can be more than one predictors that influence or are associated with response.
  • Run a simple linear regression for each predictor? Why not? (Cant make a single prediction about sale, in every model two other medias are ignored - predictors could be correlated )

Multiple Linear Regression

mod <- lm(sales ~ TV + radio + newspaper,
   data = advertisement)

broom::tidy(mod)
# A tibble: 4 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  2.94      0.312       9.42  1.27e-17
2 TV           0.0458    0.00139    32.8   1.51e-81
3 radio        0.189     0.00861    21.9   1.51e-54
4 newspaper   -0.00104   0.00587    -0.177 8.60e- 1
broom::glance(mod)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.897         0.896  1.69      570. 1.58e-96     3  -386.  782.  799.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
broom::augment(mod)
# A tibble: 200 × 10
   sales    TV radio newspaper .fitted  .resid    .hat .sigma .cooksd .std.resid
   <dbl> <dbl> <dbl>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>      <dbl>
 1  22.1 230.   37.8      69.2   20.5   1.58   0.0252    1.69 5.80e-3     0.947 
 2  10.4  44.5  39.3      45.1   12.3  -1.94   0.0194    1.68 6.67e-3    -1.16  
 3   9.3  17.2  45.9      69.3   12.3  -3.01   0.0392    1.68 3.38e-2    -1.82  
 4  18.5 152.   41.3      58.5   17.6   0.902  0.0166    1.69 1.23e-3     0.540 
 5  12.9 181.   10.8      58.4   13.2  -0.289  0.0235    1.69 1.81e-4    -0.173 
 6   7.2   8.7  48.9      75     12.5  -5.28   0.0475    1.64 1.28e-1    -3.21  
 7  11.8  57.5  32.8      23.5   11.7   0.0702 0.0144    1.69 6.45e-6     0.0420
 8  13.2 120.   19.6      11.6   12.1   1.08   0.00918   1.69 9.55e-4     0.642 
 9   4.8   8.6   2.1       1      3.73  1.07   0.0307    1.69 3.31e-3     0.646 
10  10.6 200.    2.6      21.2   12.6  -1.95   0.0171    1.68 5.95e-3    -1.17  
# ℹ 190 more rows

What is going on with newspaper?

  • We saw that sales and newspaper have a positive weak correlation (~0.22).
  • Then Why is the \(\beta_(newspaper)\) negative?
  • What do you think about its standard error?
  • What is the coefficient and SE when you execute lm(sales ~ newspaper, data = advertisement)? Why is this so different?
lm(sales~newspaper, data = advertisement)|>
  summary()|>
  broom::tidy()
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  12.4       0.621      19.9  4.71e-49
2 newspaper     0.0547    0.0166      3.30 1.15e- 3

Fishy business

The F-statistic

\[ F = \frac{(TSS-RSS)/p}{RSS/(n-p-1)} \] In a multiple linear regression we need to ensure that all coefficients for \(p\) predictors are not equal to zero. Meaning that at least one of those is non-zero. F-statistic helps us ensure that.

That means if \(H_0\) is true, F will be near 1 and if not F will be far larger than 1.

Variable Selection

Well if one or more predictors have non-zero coef and are associated with response, Which are those?

  • We will begin with trying out various subsets of predictors and assess all possible models.(Can you try out \(2^p\) models?)
  • Mallow’s \(C_p\), Akaike information criterion (AIC), Bayesian information criterion (BIC), and adjusted \(R^2\).
  • Forward selection, Backward Selection

Qualitative Variables - dummies

Income Limit Rating Cards Age Education Own Student Married Region Balance
14.891 3606 283 2 34 11 No No Yes South 333
106.025 6645 483 3 82 15 Yes Yes Yes West 903
104.593 7075 514 4 71 11 No No No West 580
148.924 9504 681 3 36 11 Yes No No West 964
55.882 4897 357 2 68 16 No No Yes South 331
80.180 8047 569 4 77 10 No No No South 1151
20.996 3388 259 2 37 12 Yes No No East 203
71.408 7114 512 2 87 9 No No No West 872
15.125 3300 266 5 66 13 Yes No No South 279
71.061 6819 491 3 41 19 Yes Yes Yes East 1350
63.095 8117 589 4 30 14 No No Yes South 1407
15.045 1311 138 3 64 16 No No No South 0
80.616 5308 394 1 57 7 Yes No Yes West 204
43.682 6922 511 1 49 9 No No Yes South 1081
19.144 3291 269 2 75 13 Yes No No East 148
20.089 2525 200 3 57 15 Yes No Yes East 0
53.598 3714 286 3 73 17 Yes No Yes East 0
36.496 4378 339 3 69 15 Yes No Yes West 368
49.570 6384 448 1 28 9 Yes No Yes West 891
42.079 6626 479 2 44 9 No No No West 1048
17.700 2860 235 4 63 16 Yes No No West 89
37.348 6378 458 1 72 17 Yes No No South 968
20.103 2631 213 3 61 10 No No Yes East 0
64.027 5179 398 5 48 8 No No Yes East 411
10.742 1757 156 3 57 15 Yes No No South 0
14.090 4323 326 5 25 16 Yes No Yes East 671
42.471 3625 289 6 44 12 Yes Yes No South 654
32.793 4534 333 2 44 16 No No No East 467
186.634 13414 949 2 41 14 Yes No Yes East 1809
26.813 5611 411 4 55 16 Yes No No South 915
34.142 5666 413 4 47 5 Yes No Yes South 863
28.941 2733 210 5 43 16 No No Yes West 0
134.181 7838 563 2 48 13 Yes No No South 526
31.367 1829 162 4 30 10 No No Yes South 0
20.150 2646 199 2 25 14 Yes No Yes West 0
23.350 2558 220 3 49 12 Yes Yes No South 419
62.413 6457 455 2 71 11 Yes No Yes South 762
30.007 6481 462 2 69 9 Yes No Yes South 1093
11.795 3899 300 4 25 10 Yes No No South 531
13.647 3461 264 4 47 14 No No Yes South 344
34.950 3327 253 3 54 14 Yes No No East 50
113.659 7659 538 2 66 15 No Yes Yes East 1155
44.158 4763 351 2 66 13 Yes No Yes West 385
36.929 6257 445 1 24 14 Yes No Yes West 976
31.861 6375 469 3 25 16 Yes No Yes South 1120
77.380 7569 564 3 50 12 Yes No Yes South 997
19.531 5043 376 2 64 16 Yes Yes Yes West 1241
44.646 4431 320 2 49 15 No Yes Yes South 797
44.522 2252 205 6 72 15 No No Yes West 0
43.479 4569 354 4 49 13 No Yes Yes East 902
36.362 5183 376 3 49 15 No No Yes East 654
39.705 3969 301 2 27 20 No No Yes East 211
44.205 5441 394 1 32 12 No No Yes South 607
16.304 5466 413 4 66 10 No No Yes West 957
15.333 1499 138 2 47 9 Yes No Yes West 0
32.916 1786 154 2 60 8 Yes No Yes West 0
57.100 4742 372 7 79 18 Yes No Yes West 379
76.273 4779 367 4 65 14 Yes No Yes South 133
10.354 3480 281 2 70 17 No No Yes South 333
51.872 5294 390 4 81 17 Yes No No South 531
35.510 5198 364 2 35 20 Yes No No West 631
21.238 3089 254 3 59 10 Yes No No South 108
30.682 1671 160 2 77 7 Yes No No South 0
14.132 2998 251 4 75 17 No No No South 133
32.164 2937 223 2 79 15 Yes No Yes East 0
12.000 4160 320 4 28 14 Yes No Yes South 602
113.829 9704 694 4 38 13 Yes No Yes West 1388
11.187 5099 380 4 69 16 Yes No No East 889
27.847 5619 418 2 78 15 Yes No Yes South 822
49.502 6819 505 4 55 14 No No Yes South 1084
24.889 3954 318 4 75 12 No No Yes South 357
58.781 7402 538 2 81 12 Yes No Yes West 1103
22.939 4923 355 1 47 18 Yes No Yes West 663
23.989 4523 338 4 31 15 No No No South 601
16.103 5390 418 4 45 10 Yes No Yes South 945
33.017 3180 224 2 28 16 No No Yes East 29
30.622 3293 251 1 68 16 No Yes No South 532
20.936 3254 253 1 30 15 Yes No No West 145
110.968 6662 468 3 45 11 Yes No Yes South 391
15.354 2101 171 2 65 14 No No No West 0
27.369 3449 288 3 40 9 Yes No Yes South 162
53.480 4263 317 1 83 15 No No No South 99
23.672 4433 344 3 63 11 No No No South 503
19.225 1433 122 3 38 14 Yes No No South 0
43.540 2906 232 4 69 11 No No No South 0
152.298 12066 828 4 41 12 Yes No Yes West 1779
55.367 6340 448 1 33 15 No No Yes South 815
11.741 2271 182 4 59 12 Yes No No West 0
15.560 4307 352 4 57 8 No No Yes East 579
59.530 7518 543 3 52 9 Yes No No East 1176
20.191 5767 431 4 42 16 No No Yes East 1023
48.498 6040 456 3 47 16 No No Yes South 812
30.733 2832 249 4 51 13 No No No South 0
16.479 5435 388 2 26 16 No No No East 937
38.009 3075 245 3 45 15 Yes No No East 0
14.084 855 120 5 46 17 Yes No Yes East 0
14.312 5382 367 1 59 17 No Yes No West 1380
26.067 3388 266 4 74 17 Yes No Yes East 155
36.295 2963 241 2 68 14 Yes Yes No East 375
83.851 8494 607 5 47 18 No No No South 1311
21.153 3736 256 1 41 11 No No No South 298
17.976 2433 190 3 70 16 Yes Yes No South 431
68.713 7582 531 2 56 16 No Yes No South 1587
146.183 9540 682 6 66 15 No No No South 1050
15.846 4768 365 4 53 12 Yes No No South 745
12.031 3182 259 2 58 18 Yes No Yes South 210
16.819 1337 115 2 74 15 No No Yes West 0
39.110 3189 263 3 72 12 No No No West 0
107.986 6033 449 4 64 14 No No Yes South 227
13.561 3261 279 5 37 19 No No Yes West 297
34.537 3271 250 3 57 17 Yes No Yes West 47
28.575 2959 231 2 60 11 Yes No No East 0
46.007 6637 491 4 42 14 No No Yes South 1046
69.251 6386 474 4 30 12 Yes No Yes West 768
16.482 3326 268 4 41 15 No No No South 271
40.442 4828 369 5 81 8 Yes No No East 510
35.177 2117 186 3 62 16 Yes No No South 0
91.362 9113 626 1 47 17 No No Yes West 1341
27.039 2161 173 3 40 17 Yes No No South 0
23.012 1410 137 3 81 16 No No No South 0
27.241 1402 128 2 67 15 Yes No Yes West 0
148.080 8157 599 2 83 13 No No Yes South 454
62.602 7056 481 1 84 11 Yes No No South 904
11.808 1300 117 3 77 14 Yes No No East 0
29.564 2529 192 1 30 12 Yes No Yes South 0
27.578 2531 195 1 34 15 Yes No Yes South 0
26.427 5533 433 5 50 15 Yes Yes Yes West 1404
57.202 3411 259 3 72 11 Yes No No South 0
123.299 8376 610 2 89 17 No Yes No East 1259
18.145 3461 279 3 56 15 No No Yes East 255
23.793 3821 281 4 56 12 Yes Yes Yes East 868
10.726 1568 162 5 46 19 No No Yes West 0
23.283 5443 407 4 49 13 No No Yes East 912
21.455 5829 427 4 80 12 Yes No Yes East 1018
34.664 5835 452 3 77 15 Yes No Yes East 835
44.473 3500 257 3 81 16 Yes No No East 8
54.663 4116 314 2 70 8 Yes No No East 75
36.355 3613 278 4 35 9 No No Yes West 187
21.374 2073 175 2 74 11 Yes No Yes South 0
107.841 10384 728 3 87 7 No No No East 1597
39.831 6045 459 3 32 12 Yes Yes Yes East 1425
91.876 6754 483 2 33 10 No No Yes South 605
103.893 7416 549 3 84 17 No No No West 669
19.636 4896 387 3 64 10 Yes No No East 710
17.392 2748 228 3 32 14 No No Yes South 68
19.529 4673 341 2 51 14 No No No West 642
17.055 5110 371 3 55 15 Yes No Yes South 805
23.857 1501 150 3 56 16 No No Yes South 0
15.184 2420 192 2 69 11 Yes No Yes South 0
13.444 886 121 5 44 10 No No Yes West 0
63.931 5728 435 3 28 14 Yes No Yes East 581
35.864 4831 353 3 66 13 Yes No Yes South 534
41.419 2120 184 4 24 11 Yes Yes No South 156
92.112 4612 344 3 32 17 No No No South 0
55.056 3155 235 2 31 16 No No Yes East 0
19.537 1362 143 4 34 9 Yes No Yes West 0
31.811 4284 338 5 75 13 Yes No Yes South 429
56.256 5521 406 2 72 16 Yes Yes Yes South 1020
42.357 5550 406 2 83 12 Yes No Yes West 653
53.319 3000 235 3 53 13 No No No West 0
12.238 4865 381 5 67 11 Yes No No South 836
31.353 1705 160 3 81 14 No No Yes South 0
63.809 7530 515 1 56 12 No No Yes South 1086
13.676 2330 203 5 80 16 Yes No No East 0
76.782 5977 429 4 44 12 No No Yes West 548
25.383 4527 367 4 46 11 No No Yes South 570
35.691 2880 214 2 35 15 No No No East 0
29.403 2327 178 1 37 14 Yes No Yes South 0
27.470 2820 219 1 32 11 Yes No Yes West 0
27.330 6179 459 4 36 12 Yes No Yes South 1099
34.772 2021 167 3 57 9 No No No West 0
36.934 4270 299 1 63 9 Yes No Yes South 283
76.348 4697 344 4 60 18 No No No West 108
14.887 4745 339 3 58 12 No No Yes East 724
121.834 10673 750 3 54 16 No No No East 1573
30.132 2168 206 3 52 17 No No No South 0
24.050 2607 221 4 32 18 No No Yes South 0
22.379 3965 292 2 34 14 Yes No Yes West 384
28.316 4391 316 2 29 10 Yes No No South 453
58.026 7499 560 5 67 11 Yes No No South 1237
10.635 3584 294 5 69 16 No No Yes West 423
46.102 5180 382 3 81 12 No No Yes East 516
58.929 6420 459 2 66 9 Yes No Yes East 789
80.861 4090 335 3 29 15 Yes No Yes West 0
158.889 11589 805 1 62 17 Yes No Yes South 1448
30.420 4442 316 1 30 14 Yes No No East 450
36.472 3806 309 2 52 13 No No No East 188
23.365 2179 167 2 75 15 No No No West 0
83.869 7667 554 2 83 11 No No No East 930
58.351 4411 326 2 85 16 Yes No Yes South 126
55.187 5352 385 4 50 17 Yes No Yes South 538
124.290 9560 701 3 52 17 Yes Yes No West 1687
28.508 3933 287 4 56 14 No No Yes West 336
130.209 10088 730 7 39 19 Yes No Yes South 1426
30.406 2120 181 2 79 14 No No Yes East 0
23.883 5384 398 2 73 16 Yes No Yes East 802
93.039 7398 517 1 67 12 No No Yes East 749
50.699 3977 304 2 84 17 Yes No No East 69
27.349 2000 169 4 51 16 Yes No Yes East 0
10.403 4159 310 3 43 7 No No Yes West 571
23.949 5343 383 2 40 18 No No Yes East 829
73.914 7333 529 6 67 15 Yes No Yes South 1048
21.038 1448 145 2 58 13 Yes No Yes South 0
68.206 6784 499 5 40 16 Yes Yes No East 1411
57.337 5310 392 2 45 7 Yes No No South 456
10.793 3878 321 8 29 13 No No No South 638
23.450 2450 180 2 78 13 No No No South 0
10.842 4391 358 5 37 10 Yes Yes Yes South 1216
51.345 4327 320 3 46 15 No No No East 230
151.947 9156 642 2 91 11 Yes No Yes East 732
24.543 3206 243 2 62 12 Yes No Yes South 95
29.567 5309 397 3 25 15 No No No South 799
39.145 4351 323 2 66 13 No No Yes South 308
39.422 5245 383 2 44 19 No No No East 637
34.909 5289 410 2 62 16 Yes No Yes South 681
41.025 4229 337 3 79 19 Yes No Yes South 246
15.476 2762 215 3 60 18 No No No West 52
12.456 5395 392 3 65 14 No No Yes South 955
10.627 1647 149 2 71 10 Yes Yes Yes West 195
38.954 5222 370 4 76 13 Yes No No South 653
44.847 5765 437 3 53 13 Yes Yes No West 1246
98.515 8760 633 5 78 11 Yes No No East 1230
33.437 6207 451 4 44 9 No Yes No South 1549
27.512 4613 344 5 72 17 No No Yes West 573
121.709 7818 584 4 50 6 No No Yes South 701
15.079 5673 411 4 28 15 Yes No Yes West 1075
59.879 6906 527 6 78 15 Yes No No South 1032
66.989 5614 430 3 47 14 Yes No Yes South 482
69.165 4668 341 2 34 11 Yes No No East 156
69.943 7555 547 3 76 9 No No Yes West 1058
33.214 5137 387 3 59 9 No No No East 661
25.124 4776 378 4 29 12 No No Yes South 657
15.741 4788 360 1 39 14 No No Yes West 689
11.603 2278 187 3 71 11 No No Yes South 0
69.656 8244 579 3 41 14 No No Yes East 1329
10.503 2923 232 3 25 18 Yes No Yes East 191
42.529 4986 369 2 37 11 No No Yes West 489
60.579 5149 388 5 38 15 No No Yes West 443
26.532 2910 236 6 58 19 Yes No Yes South 52
27.952 3557 263 1 35 13 Yes No Yes West 163
29.705 3351 262 5 71 14 Yes No Yes West 148
15.602 906 103 2 36 11 No No Yes East 0
20.918 1233 128 3 47 18 Yes Yes Yes West 16
58.165 6617 460 1 56 12 Yes No Yes South 856
22.561 1787 147 4 66 15 Yes No No South 0
34.509 2001 189 5 80 18 Yes No Yes East 0
19.588 3211 265 4 59 14 Yes No No West 199
36.364 2220 188 3 50 19 No No No South 0
15.717 905 93 1 38 16 No Yes Yes South 0
22.574 1551 134 3 43 13 Yes Yes Yes South 98
10.363 2430 191 2 47 18 Yes No Yes West 0
28.474 3202 267 5 66 12 No No Yes South 132
72.945 8603 621 3 64 8 Yes No No South 1355
85.425 5182 402 6 60 12 No No Yes East 218
36.508 6386 469 4 79 6 Yes No Yes South 1048
58.063 4221 304 3 50 8 No No No East 118
25.936 1774 135 2 71 14 Yes No No West 0
15.629 2493 186 1 60 14 No No Yes West 0
41.400 2561 215 2 36 14 No No Yes South 0
33.657 6196 450 6 55 9 Yes No No South 1092
67.937 5184 383 4 63 12 No No Yes West 345
180.379 9310 665 3 67 8 Yes Yes Yes West 1050
10.588 4049 296 1 66 13 Yes No Yes South 465
29.725 3536 270 2 52 15 Yes No No East 133
27.999 5107 380 1 55 10 No No Yes South 651
40.885 5013 379 3 46 13 Yes No Yes East 549
88.830 4952 360 4 86 16 Yes No Yes South 15
29.638 5833 433 3 29 15 Yes No Yes West 942
25.988 1349 142 4 82 12 No No No South 0
39.055 5565 410 4 48 18 Yes No Yes South 772
15.866 3085 217 1 39 13 No No No South 136
44.978 4866 347 1 30 10 Yes No No South 436
30.413 3690 299 2 25 15 Yes Yes No West 728
16.751 4706 353 6 48 14 No Yes No West 1255
30.550 5869 439 5 81 9 Yes No No East 967
163.329 8732 636 3 50 14 No No Yes South 529
23.106 3476 257 2 50 15 Yes No No South 209
41.532 5000 353 2 50 12 No No Yes South 531
128.040 6982 518 2 78 11 Yes No Yes South 250
54.319 3063 248 3 59 8 Yes Yes No South 269
53.401 5319 377 3 35 12 Yes No No East 541
36.142 1852 183 3 33 13 Yes No No East 0
63.534 8100 581 2 50 17 Yes No Yes South 1298
49.927 6396 485 3 75 17 Yes No Yes South 890
14.711 2047 167 2 67 6 No No Yes South 0
18.967 1626 156 2 41 11 Yes No Yes West 0
18.036 1552 142 2 48 15 Yes No No South 0
60.449 3098 272 4 69 8 No No Yes South 0
16.711 5274 387 3 42 16 Yes No Yes West 863
10.852 3907 296 2 30 9 No No No South 485
26.370 3235 268 5 78 11 No No Yes West 159
24.088 3665 287 4 56 13 Yes No Yes South 309
51.532 5096 380 2 31 15 No No Yes South 481
140.672 11200 817 7 46 9 No No Yes East 1677
42.915 2532 205 4 42 13 No No Yes West 0
27.272 1389 149 5 67 10 Yes No Yes South 0
65.896 5140 370 1 49 17 Yes No Yes South 293
55.054 4381 321 3 74 17 No No Yes West 188
20.791 2672 204 1 70 18 Yes No No East 0
24.919 5051 372 3 76 11 Yes No Yes East 711
21.786 4632 355 1 50 17 No No Yes South 580
31.335 3526 289 3 38 7 Yes No No South 172
59.855 4964 365 1 46 13 Yes No Yes South 295
44.061 4970 352 1 79 11 No No Yes East 414
82.706 7506 536 2 64 13 Yes No Yes West 905
24.460 1924 165 2 50 14 Yes No Yes West 0
45.120 3762 287 3 80 8 No No Yes South 70
75.406 3874 298 3 41 14 Yes No Yes West 0
14.956 4640 332 2 33 6 No No No West 681
75.257 7010 494 3 34 18 Yes No Yes South 885
33.694 4891 369 1 52 16 No Yes No East 1036
23.375 5429 396 3 57 15 Yes No Yes South 844
27.825 5227 386 6 63 11 No No Yes South 823
92.386 7685 534 2 75 18 Yes No Yes West 843
115.520 9272 656 2 69 14 No No No East 1140
14.479 3907 296 3 43 16 No No Yes South 463
52.179 7306 522 2 57 14 No No No West 1142
68.462 4712 340 2 71 16 No No Yes South 136
18.951 1485 129 3 82 13 Yes No No South 0
27.590 2586 229 5 54 16 No No Yes East 0
16.279 1160 126 3 78 13 No Yes Yes East 5
25.078 3096 236 2 27 15 Yes No Yes South 81
27.229 3484 282 6 51 11 No No No South 265
182.728 13913 982 4 98 17 No No Yes South 1999
31.029 2863 223 2 66 17 No Yes Yes West 415
17.765 5072 364 1 66 12 Yes No Yes South 732
125.480 10230 721 3 82 16 No No Yes South 1361
49.166 6662 508 3 68 14 Yes No No West 984
41.192 3673 297 3 54 16 Yes No Yes South 121
94.193 7576 527 2 44 16 Yes No Yes South 846
20.405 4543 329 2 72 17 No Yes No West 1054
12.581 3976 291 2 48 16 No No Yes South 474
62.328 5228 377 3 83 15 No No No South 380
21.011 3402 261 2 68 17 No No Yes East 182
24.230 4756 351 2 64 15 Yes No Yes South 594
24.314 3409 270 2 23 7 Yes No Yes South 194
32.856 5884 438 4 68 13 No No No South 926
12.414 855 119 3 32 12 No No Yes East 0
41.365 5303 377 1 45 14 No No No South 606
149.316 10278 707 1 80 16 No No No East 1107
27.794 3807 301 4 35 8 Yes No Yes East 320
13.234 3922 299 2 77 17 Yes No Yes South 426
14.595 2955 260 5 37 9 No No Yes East 204
10.735 3746 280 2 44 17 Yes No Yes South 410
48.218 5199 401 7 39 10 No No Yes West 633
30.012 1511 137 2 33 17 No No Yes South 0
21.551 5380 420 5 51 18 No No Yes West 907
160.231 10748 754 2 69 17 No No No South 1192
13.433 1134 112 3 70 14 No No Yes South 0
48.577 5145 389 3 71 13 Yes No Yes West 503
30.002 1561 155 4 70 13 Yes No Yes South 0
61.620 5140 374 1 71 9 No No Yes South 302
104.483 7140 507 2 41 14 No No Yes East 583
41.868 4716 342 2 47 18 No No No South 425
12.068 3873 292 1 44 18 Yes No Yes West 413
180.682 11966 832 2 58 8 Yes No Yes East 1405
34.480 6090 442 3 36 14 No No No South 962
39.609 2539 188 1 40 14 No No Yes West 0
30.111 4336 339 1 81 18 No No Yes South 347
12.335 4471 344 3 79 12 No No Yes East 611
53.566 5891 434 4 82 10 Yes No No South 712
53.217 4943 362 2 46 16 Yes No Yes West 382
26.162 5101 382 3 62 19 Yes No No East 710
64.173 6127 433 1 80 10 No No Yes South 578
128.669 9824 685 3 67 16 No No Yes West 1243
113.772 6442 489 4 69 15 No Yes Yes South 790
61.069 7871 564 3 56 14 No No Yes South 1264
23.793 3615 263 2 70 14 No No No East 216
89.000 5759 440 3 37 6 Yes No No South 345
71.682 8028 599 3 57 16 No No Yes South 1208
35.610 6135 466 4 40 12 No No No South 992
39.116 2150 173 4 75 15 No No No South 0
19.782 3782 293 2 46 16 Yes Yes No South 840
55.412 5354 383 2 37 16 Yes Yes Yes South 1003
29.400 4840 368 3 76 18 Yes No Yes South 588
20.974 5673 413 5 44 16 Yes No Yes South 1000
87.625 7167 515 2 46 10 Yes No No East 767
28.144 1567 142 3 51 10 No No Yes South 0
19.349 4941 366 1 33 19 No No Yes South 717
53.308 2860 214 1 84 10 No No Yes South 0
115.123 7760 538 3 83 14 Yes No No East 661
101.788 8029 574 2 84 11 No No Yes South 849
24.824 5495 409 1 33 9 No Yes No South 1352
14.292 3274 282 9 64 9 No No Yes South 382
20.088 1870 180 3 76 16 No No No East 0
26.400 5640 398 3 58 15 Yes No No West 905
19.253 3683 287 4 57 10 No No No East 371
16.529 1357 126 3 62 9 No No No West 0
37.878 6827 482 2 80 13 Yes No No South 1129
83.948 7100 503 2 44 18 No No No South 806
135.118 10578 747 3 81 15 Yes No Yes West 1393
73.327 6555 472 2 43 15 Yes No No South 721
25.974 2308 196 2 24 10 No No No West 0
17.316 1335 138 2 65 13 No No No East 0
49.794 5758 410 4 40 8 No No No South 734
12.096 4100 307 3 32 13 No No Yes South 560
13.364 3838 296 5 65 17 No No No East 480
57.872 4171 321 5 67 12 Yes No Yes South 138
37.728 2525 192 1 44 13 No No Yes South 0
18.701 5524 415 5 64 7 Yes No No West 966

Exercise

Use Credit in {ISLR2} data, use balance as response and create a model that you think is good. Try out qualitative variables as well. Begin by looking at all possible variables in data and think which could affect the response.

TRY the {broom} package as well.

Extending the linear model

  • Additive property (\(X_j\) may be associated with response in a manner that changes with different values of other predictors )
  • Linear property (Response will experience the same amount of changes with one unit change in a predictor, regardless of the value of predictor.)

We will see some “classic” ways of relaxing these properties.

Breaking Additive Assumption by Interaction

Instead of \(Y = \beta_0 + \beta_1X_1 + \beta_2X_2\)

We add a interaction term for \(X_1\) and \(X_2\)

\[ \begin{align} \begin{aligned} Y & = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_1X_2\\ & = \beta_0 + X_1(\beta_1 + \beta_3X_2)+\beta_2X_2 \end{aligned} \end{align} \]

What to do when p-values of \(\beta_1\) and \(\beta_2\) are large and the p-value of the interaction term is small? (hierarchical principle)

Exercise

Use the Credit data in the {ISRL2} library.

How does interation work when the variables of interest are QL * QN or QL * QL

Create a dummy for student

compare these two models

\[Y = \beta_0 + \beta_1income + \beta_2dummy\_student + \beta_3(income*dummy\_student)\] and \[Y = \beta_0 + \beta_1income + \beta_2dummy\_student\]

Polynomial Regression

\(mpg ~ \beta_0 + \beta_1horsepower + \beta_2horsepower^2\)

term estimate std.error statistic p.value
(Intercept) 56.900099702 1.8004268063 31.60367 1.740911e-109
horsepower -0.466189630 0.0311246171 -14.97816 2.289429e-40
horsepower_2 0.001230536 0.0001220759 10.08009 2.196340e-21
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.688         0.686  4.37      428. 5.40e-99     2 -1133. 2274. 2290.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

\(mpg ~ \beta_0 + \beta_1horsepower\)

term estimate std.error statistic p.value
(Intercept) 39.9358610 0.717498656 55.65984 1.220362e-187
horsepower -0.1578447 0.006445501 -24.48914 7.031989e-81
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.606         0.605  4.91      600. 7.03e-81     1 -1179. 2363. 2375.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Potential Problems

  • Non-linearity of the response-predictor relationships.
  • Correlation of error terms.
  • Non-constant variance of error terms.
  • Outliers.
  • High-leverage points.
  • Collinearity.

Potential problems - Non linear Data

Residual vs Fitted, if there is indication of non-linear relation, try non-linear transformations on you predictors.

Potential Problems - Correlated Error Terms

  • We assume that \(\epsilon_1, \epsilon_2, ...\epsilon_n\) are uncorrelated. This means that on reasonable deduction can be done about \(\epsilon_{n+1}\) from the information we have about \(\epsilon_n\).
  • What if they are?
  • recall that \(SE\) is calculated based on this assumption.
  • This means that in case there is correlation in error terms, we may end up trusting the model more than we should.
  • This is often seem in time-series
  • How to test - Durbin-Watson test, Ljung-Box Q test

Potential Problems - Non-Constant Variance of error term

  • This means that \(Var(\epsilon_i)\) could increase or decrease with the response.
  • If this is found, this could overestimate \(SE\).
  • Can be identified with a funnel shape of the residual vs fit chart.
  • To address this, use of a concave function, e.g. log(Y) or \(\sqrt{Y}\).

Potential Problem - Outlier

An outlier is a point for which \(Y_i\) is far from the value predicted by the model.

  • If the predictor value for an Outlier is usual, then it has little effect on the fit, but can effect RSE, p-values and \(R^2\).
  • We can use residual vs fitted chart to identify an Outlier. But the question remains: How high is high?
  • To answer that, plot studentized residuals. How are these calculated? studres(model). For each residual this value is expected to be less that absolute 3, i.e. between -3 and 3.

Potential Problems - High leverage

When the predictor is unusually high, those point are called observations with high leverage.

  • High leverage points have high impact on the fit line.
  • A point can be high leverage even if it is usual for each individual predictor but not as a set predictors.
  • leverage is calculated to quantify this (for one predictor).

\[h_i = \frac{1}{n}+\frac{(x_i - \bar{x})^2}{\sum_{i'=1}^{n}(x_{i'} - \bar{x})}\]

  • \(h_i\) is between \(\frac{1}{n}\) and \(1\)
  • “the average leverage for all the observations is always equal to \(\frac{(p+1)}{n}\).”

Potential Problems - Collinearity

  • When collinearity exists between two variables, it is difficult to say how individually one predictor is associated with response.
  • Look at correlation matrix for all variables. (Not a catch all solution - multi-collinearity)
  • The VIF is variance inflation factor, the ratio of the variance of \(\hat\beta_j\) when fitting the full model divided by the variance of \(\hat\beta_j\) if fit on its own.
  • VIF for individual predictor can be computed by:

\[VIF(\hat\beta_j) = \frac{1}{1-R_{X_j|X_{-j}}^2}\]

  • Two was to deal with this. Drop predictors with high VIF or combine predictors into a single one.